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1. INTRODUCTION 


Feature selection is one of the important problems in patterti recognition. 
Considerable interest has been shown on this problem in recent literature. 
Usually, the performance of the recognition system is expressed in terms of 
the probability of misrecognition P^. Unfortunately, it is often difficult 
to obtain an analytical expression for P^; and even if one can be obtained, 
it will usually be too complicated to permit analytical or numerical compu- 
tation. Hence, certain probabilistic distance measures (ref. 1), which are 
easy to evaluate and manipulate, are used as criteria for the selection of 
effective features. 

The distance measures that are normally used in practice are listed in table 1. 
Among these distance measures, divergence (refs. 2 to 5) and Bhattacharyya 
distance (refs. 6 and 7) are extensively investigated in the literature. 

These distance measures either provide bounds on the probability of error 
or give intuitive justification for the measure of separability between the 
classes. If the distributions of the patterns in the classes are assumed to 
be multivariate normal; i.e., if 

p(X|u>^) - N(m., Z^), 

closed-for-m expressions can be derived for the distance measures given in 
table 1. The closed-form expressions are listed in table 2. 

For feature selection, the use of the distance measures is as follows. Sup- 
pose that r features are to be selected out of given S features. There are 
(p) different combinations of r features. In a two-class case, for each 
feature subset one of the criteria given in table 2 is computed as a measure 
of effectiveness of the feature subset; and that feature subset is selected 
as the best, which extremizes the criterion. In a multiclass case (refs. 6 
and 13), the distance measures are comp’ited for the feature subset between 
all pairs of classes; and the maximum of either the minimum distance between 
class pairs or the mean value of the distance between class pairs is used as 
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TABLE 1.- PROBABILISTIC DISTANCE MEASURES 


} 
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For Mahalanobis distance, the distributions of the patterns in the 
Classes are normal with means m. and cormon variance matrix Z. 




TABLE 2. DISTANCES BETWEEN TWO MULTIVARIATE NORMAL DENSITIES 





a measure of the feature subset's effectiveness. Because the complete cri- 
terion function Is to be computed for each feature subset. It Is computation- 
ally very Inefficient. 

The purpose of tnis paper Is to derive recursive relations for the criteria 
listed In table 2. That Is, expressions are derived for the change In the 
criteria when a feature Is deleted from the current feature subset. Expres- 
sions are also derived for the change In the criteria when a feature Is 
added to the current feature subset. A combinatorial algorithm (ref. 14) 
is presented; It generates all possible r feature combinations out of given 
S features with a single feature change at each step in Sc^ steps. This 
algorithm and the recursive relations provide an efficient method of choosing 
the best feature subset out of all possible feature subsets. The paper Is 
organized as follows. 

In section 2, recursive relations are derived for computing the distance 
measures when a feature is added to the current feature subset. In section 3, 
recursive expressions are developed for the computation of distance measures 
when a feature is deleted from the current feature subset. In section 4, a 
combinatorial algorithm is presented for generating all possible r combina- 
tions of S in (^) steps with a single change at each step. Matrix relations 
used in the paper are derived in the appendices. 

2. RECURSIVE RELATIONS FOR DISTANCE MEASURES 
WHEN A FEATURE IS ADDED 

In this section, expressions are developed for recursively updating the dis- 
tance measures (presented in section 1) when a feature is added to the current 
feature subset. Let the current feature subset contain features Xp X 2 « •••, 
x^_^ . The pattern containing these features is represented as 


T 



Let ^ and ^ be the means and covariance matrices of the patterns 
^r -1 class 1, 1 ■ 1, 2, •••, M, where M Is the number of classes. Let a 
feature be added to the current feature subset. Then let the mean and 
covariance matrix of the pattern In class 1 be m^ . and .. The m^ , . 
and '"pj*. i ^ and ^ are related as follows. 

r-l ,1 

"rj ■ 

.“'-.t J 

and 





0 


o 


r-1,1 


r,1 


(3) 


A careful examination of table 2 shows that all the criteria listed contain 
terms such as determinant of a covariance matrix. Inverse of a covariance 
matrix, and trace of the product of two matrices. In appendix A, recursive 
expressions for these component terms are developed. In the following, these 
relations are used to develop expressions for the recursive computation of 
the distance measures listed In table 2 when a feature Is added to the 
current feature subset Xp X 2 , •••, 


2.1 DIVERGENCE 

From table 2, the divergence between two Gaussian distributed pattern classes 
can be written as 


•H 




<2=^r.l 


+ ^tr 




21 




(4) 
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From equations (A-6) and (A-8), the following is obtained. 


^ ^r,2“l|2 ^ ^r,l“2|l ' ^ 


Let 


"’r.12 “ "’r.l ■ "’r.2 



' vi.r 


^- 1,2 


Vl ,12 

s 


- 


■ 



A .' . 


>.2 . 


.^•",12 


m 


(b) 


( 6 ) 

From equations (6) and (A-1), the following equation is derived: 

r,12^rll"’r,12 “ "’r-1 ,12^r^l ,1 Vl ,12 ^ ^r,l ^'"J-l ,12‘^r-l ,1 

■ ^,12*r,l^"’r-l,12®r-l,l) * ^r.l2^r.l ^'"J-1 ,12®r-l .1 ) 

‘ * ®r,l •"'I-l ,12Vl ,] ' “r,I2*^ 

Similar to equation (7), equation (8) can be written as follows: 

"’r,12^r,2"’r,12 * Vl .12^rl-l ,2^-1 ,12 '^r,2^"’r-l ,12®r-l ,2 " ^r,12^^ 

Combining equations (4), (5), (7), and (8) results in 

Jr • Jr-1 * ?('^r,2''l|2 ^ ^.l“2|l ‘ 

^ ? ^r,l^'"J-l,12®r-l,l ■ ^r,12^^ * \ ^r,2^"’r-l ,12®r-l ,2 " ^r,12^^ 
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2.2 


BHATTACHARYYA DISTANCE 


The Bhattacharyya distance between two Gaussian distributed pattern classes 
Is given by 






*5 


Cn 


det 




[{det(E^^^)det(E^ 




( 10 ) 


Let * ^r,l * ^r,2* (H) can be written similarly to 

equation (7): 

"’r,12^rll+2'"r.l2 ' ""r-l ,12^r^ ,1+2'"r-l .12 

^ ‘^r,l+2^'"r-l,12®r.l.l+2 ‘ ^r,12^^ 
From equation (A-5), the following equation Is obtained: 

3c tn 


( 11 ) 


K.l * =^r.2> ' 

X — 0 fi 


(det(E^^,)det(E^^2)^** 

2 

{det(E^^^)det(E^^2)^**. 




Hn 






* j In 


o) 


76 


. rr.2 


r,l+2 

( 12 ) 


From equations (10), (11), and (12), the recursive relation for the 
Bhattacharyya distance is obtained as 


" ®r-l * ^ ^r,U2^"’r-l,12®r-l,l+2 " ^r,12^^ + ^ in 


pr,1^.2^ S 

' ^^r,1+2 f 


(13) 
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2.3 JEFFREYS-MATUSITA DISTANCE 


The Jeftreys-Matusita distance between two Gaussian distributed pattern 
classes Is obtained from table 2: 


• 2 


,)det(r 2)]^ , T 


r . 




r.l 


"r.2>Jj 


( 14 ) 


^r,12 ' ^r.l ■ ^r,2‘ 


Similar to equation (7), the following equation Is derived: 


'"r,12^rIlA,12 “ "’r-1 ,12^r^ 1 ,12"’r-l ,12 ^r,12^'"r-l ,12‘^r-l ,12 " ^r,12^^ 


Let 


( 2 fir 1+?^** Pit 2I 

^ “ 7T“i ^ ®^P|'4 ^r.l2^Vl,12°r-l,12 ’ ^r,12^ J 

^V,l®r,2^ ■' 


(17) 


From equations (A-5), (14), (16), and (17), the recursive relation for the 
Jeffreys-Matusita distance Is obtained as 

JM^ - C JM^_^ + 2(1 - C) (18) 


2 . 4 KULLBACK-LEIBLE R NUMBE RS 


The Kul Iback-Leibler number between two Gaussian distributed pattern classes 
is given by 


KL12. 


2 " tn 


■det(i:^^^)' 


3etTI 


r.2 





From equations (7), (!?', (A-5), and (A-6), the recursive relation for the 
Kul Iback'Lelbler number Is obtained as 


KL12^ 


KL12 


r-1 


In 


* ^*■■■2“' I? 


1 ) 


+ 


1 

? 



(m 


T 

r-1, 12 


0 


r-1. 2 



( 20 ) 


2.5 MAHALANOBIS DISTANCE 


In Mahalanobis distance, Z Is usually taken as an average of the covariance 
matrices of the two pattern classes. Then It Is defined as 





♦ Z 


HzI 






( 21 ) 


From equations (7) and (21), the recursive expression for the Mahalanobis 
distance Is obtained as 




^^r,l+2^'"J-l,12®r-l,l+2 ' ^r,12^^ 


( 22 ) 


3. RECURSIVE RELATIONS FOR DISTANCE MEASURES WHEN A FEATURE IS DELETED 


Recursive expressions are presented In this section for updating the dis- 
tance measures given In section 1 when a feature Is deleted from the current 
feature subset. Let the current feature subset contain features x^ , X 2 « •••, 
x^. The pattern containing the«;e features Is represented as 

* (x^, •••, x^)^ 

Let m . and E . be the mean and the covariance matrix of the pattern X In 
r » 1 r » I r 

class 1. Let the feature x be deleted from the current feature subset. 

r 

Then let the mean and the covariance matrix of the patterns X^_^ In class 1 
be m^_^ . and y The m^ ^ and the m^_^ 1 ’ 1 ^r-l,i <^elated 

as In equations (2) and (3). Appendix B presents the derivations of the 
recursive relations tor the determinant of a covariance matrix, the Inverse 
of a covariance matrix, and the trace of the product oi* two matrices when a 
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feature Is deleted from the current feature subset. These relations are 
used In the following subsections In deriving expressions for recursively 
computing the distance measures when a feature Is deleted from the current 
feature subset (section 1). 


3.1 DIVERGENCE 

The divergence between two Gaussian distributed pattern classes Is given by 
equation (4). From equation (B-13), the following Is obtained: 


From equations (6), (B-1), and (B-8), the following Is obtained: 
'"r. 12 V, 1^,12 * "’r.l,12'^V-l,l"’r-l,12 ^ ^r,12^r-l .Tr-l ,12 

"’r-l,12^r-l,1^r,12 ^ ^r,12®r,l 

- -T r-1 _ . ^'"r-l,12^r-l,l^^ 

"’r-1.12^r-l,l"’r-1.12 V", 

r, I 

* ^^r.l2^r-l,l"’r-1.12 * ^r,12'^r,l 


(^3) 




U u 1 

V-i.t 

LVl,12^r,12j 

>•’ . 


T 2 


"’I--l,12^^|^^,Vl,12 




(24) 


V.l 


Similar to equations (23) and (24), the following Is obtained: 

,.T 

I 

6. 




V,1 


"’r,12^rl2"’r,12 * "’r-1 ,12^r^ ,2"’r-l ,12 


^ ('"r,12^r,2) 


V,2 


(25) 


(26) 
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From equation (4) and equations (23) to (26), the following recursive re* 
latlon Is obtained for the computation of divergence. 


1 ■ 

r- 1 r 


j 


r.2 


. .rjjLTj}?L Txi + 1 


( 27 ) 


3.2 BHATTACHARYYA DISTANCE 


The Bhattacharyya distance between two Gaussian distributed pattern classes Is 
giver by equation (10). Similar to equation (24), equation (28) Is obtained: 


J r'l m 
'^r.l2^r,U2'"r,12 


T ,.-1 

Vl.l2‘r-1.1+2"’r-l,12 


♦ L”Li£rJLt2),_ 


V.1+2 


(28) 


From equations (10), (28), and (B-9), a recursive expression for the computa- 
tion of Bhattacharyya distance can be obtained as 


V-1 


= B. 


K,12V,J+2)_^_ _ 

1 

0 n 


^,1+2 


•CM 

^^r,l+2 

m 


(29) 


3 . 3 JEFFREYS-MATUSITA DISTANCE 


The Jeffreys-Matusita distance between two Gaussian distributed pattern 
classes is given by equation (14). Equation (15) is used similarly to 
equation (24) to obtain 


Let 


'"J,12^rll2"’r,12 


m 


T 

r-1,12 


I 


-1 

r-1,1 


C 


r-1,12 



(30) 


r ^^*r,12^ 

C r exp 


1 ^'"r,12^r,12^‘ 


V,12 


(31) 
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From equations (15), (B-9), (30), and (31), a recursive expression for the 
computation of Jef freys-Matusi ta distance when a feature is deleted from the 
current feature subset can be written as 


JM^.i • 2 ♦ - :•] 


(32) 


3 . 4 KUL LBAC K- LEI BL E R NDM3E RS 

From table 2, the Kul Iback-Leibler numbers between two Gaussian distributed 
pattern classes is given in equation (19). Equations (B*9), (23', and (26) 
in (19) can be used to write a recursive expression for the computation 
of the Kul Iback-Leibler number as follows: 


KL12 


r-1 





^,2 


1 

2 ■ 



(33) 


3 . 5 MAHAL ANOBIS DISTANCE 

The Mahalanobis distance, taking the covariance matrix in it as the average 
of the covariance matrices of the two pattern classes, can be written as 


2(n 


r,l 






m -,) 
r,2' 


(34) 


From equations (28) and (34), a recursive relation for the computation of 
Mahalanobis distance when a feature is deleted from the current feature sub 
set is obtained: 




r_,]2^r,lj^ 

'^r.1+2 


(35) 


4. A COMBINATORIAL ALGORITHM FOR GENERATING ALL POSSIBLE COMBINATIONS 

This section describes an algorithm for generating all possible r combina- 
tions out of S in Sc^ steps. At each step, a single change is made; i.e., 
one feature is deleted and one is added. The recursive relations developed 
in sections 2 and 3, coupled with this algorithm, can be effectively used 
to search for a best feature subset of r features out of all possible 
(^) feature subsets using probabilistic distance measures as the criteria. 
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The Initial combination may be any combination in which all the r-selected 
features are numbered consecutively. In the binary representation it means 
that all the r I's are in one run in a vector of length S. For example, if 
r ■ 3 and S ■ 5, one may start with 11100 or 00111. The binary vector is de- 
noted by A, and its \th component is A(i). Initially, all the components of A, 

except those of the last run, are marked. For example, if A ■ 00111000 

(for r ■ 3 and S » 8), then it is marked as OOTTTOOO. 

If a is a symbol, a*" stands for aa »•» a m times. Let i be the highest 
index j such that A(j) is marked. A vector T(l), T(2), •••, T(S) of integers 
t.iat satisfy the condition |T(j)| < j for j « 1, 2, •••, S is defined. 
Initially, T(l) ■ 0. If the initial combination is where 

S > r + p, then T(p + r) * -1 and all the rest are immaterial. If the 

initial combination is (0)^*'^l'^, then T(5 -r) » -1 and all the rest are 

immaterial. The changes T must undergo in each combination generation are 
described by subroutines a and 6 as follows. 

a: (i) If T(k) * 0, then output A and halt. 

(ii) If T(k) > 0, then i T(k), output A and go tu step 1 of 

the algorithm. 

(iii) 1 - k - 1. If T(k) > -(k - 1), then T(k - 1) T(k). 

(iv) Output A and go to step 1 of the algorithm. 

B: (i) T(i) ♦- -(k + 1). If T(k) ^0, then T(k + 1 ) -^ T(k), output A, 

and go to step 1 of the algorithm. 

(ii) T(k + 1) k - 1. If T(k) > -(k - 1), then T(k - 1) T(k). 

(ill) Output A and go to step 1 of the algorithm. 

Now the vector F(0), F(l), •••, F(S) is introduced as follows. If A(m) = 1 
and if it is the rightmost element in a run of I's, then F(m) is the index 
of the first 1 of this run. If not, F(m) is immaterial. Let t be the 
index of the rightmost 1; that is, il = max m. 

A(m) = 1 
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Now an algorithm for generating all possible combinations with a single 
change at each step can be described as follows. The initial conditions of 


the algorithm are illustrated as follows. Let r 
initial A » 01110000. Then i » 4. 

■ 3, S ■ 8 with an 

1 . 

k ' i. If A(i ) « 1 , go to step 8. 


2. 

j - F(t). 


3. 

A(i) 1, A(j) 0, and F(k) k. If A(k - 
F(k) - F(k - 1). F(t) - j + 1, if j < f, go 

1 ) » 1 and k > 1 , then 
to step 5. 

4. 

t i . Perform a. 


5. 

If t < S, go to step 7. 


6. 

i j. Perform 8. 


7. 

i t. Perform 0. 


8. 

F(i -!)■»- F(i). If 1 > i, go to step 12. 


9. 

A(i) 0, A(S) - 1, F(S) ^ S, t ^ S. If i < 

S - 1 , go to step 1 1 . 

10. 

Perform a. 


11. 

i *- S - 1. Perform B. 


12. 

.i - F(t). 


13. 

A(i) - 0, A(j - 1)^1. F(t) ^ j - 1. If It 

< S, go to step 17. 

14. 

If £ + 1 •»- j - 1 , go to step 16. 


15. 

Perform a. 


16. 

1 ♦- j - 2. Perform 6. 


17. 

i Perform 0. 



5. CONCLUSIONS 

This paper considered probabilistic distance measures as criteria for feature 
subset evaluation. The measures discussed are divergence, Shattacharyya 
distance, Jeffreys-Matusita distance, Kul Iback-Leibler numbers, and 
Mahalanobis distance. 
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The problem of finding the best feature subset is that of evaluating all 
possible feature subsets and selecting the one that extremizes the criteria. 
Recursive expressions are derived for computing the criteria as a change in 
the distance measures, both when a feature is added to the current feature 
subset and when a feature is deleted from the current feature subset. A 
combifiatorial algorithm is presented for generating all possible r feature 
combinations from a given set of S features in (^) steps with a change of a 
single feature at each step. These recursive expressions and the combina- 
torial algorithm provide an efficient way of finding by exiiaustive search the 
best feature subset using the probabilistic distance measures as criteria. 
These expressions can also be used for finding the suboptimal feature subset 
using forward or backward sequential feature selection methods. 
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APPENDIX A 


RECURSIVE MATRIX RELATIONSHIPS WHEN A FEATURE IS ADDED 


In this appendix, recursive relations are derived 
matrix, determinant of a matrix, and trace of the 
when a feature x^ Is added to the current feature 


for the Inverse of a 
product of two matrices 
subset x^, **'• ’‘r-r 


A . 1 INVER SE OF A COVARIAN CE M ATRIX 


It can be shown that the Inverse of equation (3) can be written (ref. 15) as 


where 


and 


“r,1 


-rll,1 ^ ®r,1®r-l,1®I-l,1 


■‘^r,1®r-1,1 


■^r,1®r-l,1 


r,1 


(A-1) 




(A-2) 


A . 2 ^T|RM INANT OF A COVARIANCE MATRIX 

Let the matrix I . be partitioned as In equation (3). Consider a matrix B. 


B « 


I 

0 


r-l ,1 ^r- 


l.i 


I 


(A-3) 


A-1 


The determinant of matrix B Is unity. Form a matrix B E .B, 

• * I 




J r-1 




V-1,1 ‘^r.i 




rr 


(A-4) 


where 


(A-5) 


r-l.i 

Taking the determinants on both sides of equation (A-4), one obtains the 
following: 

detdrj) * (o^.i • 

1 > 

r * 1 

A . 3 Tl^ t„0F THE P RODUCT OF TWO MAT RICES 

From equations (3) and (A-1), the following is obtained: 

^'"l^r.l^rl^) “ ^'"j^r-l,1^r-1,2 ^ ^r ,2^r-1 ,l®r-l ,2®r-l ,2 

■ ^r,2^r-l,l®I-1,2 ' '^r,2‘^r-l ,l®r-l .2 ^ ^r,l‘^r,2) 

^ ^.2[0r.l.2Vl,lVl.2 - 2^r.1.lVl,2 " ^.l] 

" ^''(^r-1 ,1^r^l , 2 ) ^ '^r,2“l|2 
‘MI 2 ' ®r-l,2^r-1,l°r-l,2 ’ ^^r-1 ,l®r-l ,2 ^ °r,l 


A-2 



The following equation is obtained similarly: 


^'“(“r.Z^rll) ' ^'■(^r-1.2^rll,l) * ^r,l“2|l 


where 


5^ ‘ '’rj ■ '^I-M^rll.lVl.l 

®r-l,l * ^r^l,l^r-l,l j 

'^2|1 “ ®r-l,l^r-l,2®r-l,l ' ^"^r-l ,2®r-l ,1 °r,Z 


(A-8) 


(A-9) 


A-3 



APPENDIX B 


RECURSIVE MATRIX RELATIONSHIPS WHEN A FEATURE IS DELETED 


This appendix derives recursive expressions when a feature x is deleted 
from the current feature subset x^ , X 2 » x^ for the inverse of a covari- 
ance matrix, determinant of a covariance matrix, and trace of the product of 
two matrices. 


B . 1 INVERSE OF A COVARIANCE MATRIX 

Let the inverse of the covariance matrix I . of equation (3) be represented 

» t I 


r-l,i 


r 

r-l,i 

‘r.1 


Since is the inverse of E one has 

r * I r , 1 

. I (B-2) 


From equations (3), (B-1), and (B-2), the following are obtained: 


V-l,i^r-1,i ^r-l,i*^I--l,i * ^ 

(B-3) 

'^r-Ui'^r-M * ‘^r,i^r-l,i “ ° 

(B-4) 

^r-1,i r- 1,1 r,vr-l,i 

(B-5) 

^r,i°r,i " ’ 

(B-6) 

noting that i‘^r-1 i ^ matrix of unit rank, one gets 

(ref. 15), 

from equation (B-3) 

.-1 ^ 

Vi,i Vi,i 1 J . 

' ■ Vl,i^r-l,i 

(B-7) 


B-1 


Using equations (B-4) and (B-6) in (B-7) yields 


■ Vl.i 


V-l.lV-l.t 
■ «r.i 


B.2 DE TERMIW AWT OF A COVA IIIANCE MATRIX 

From equation (A-5), det(I^ and det(E^_^ are related as 

B.3 TR ACE OF TH E PRO DUCT OF TWO MATRICES 
From equations (3), (B-1), and B-8), one obtains 

^•"(“rj^rlz)* ^'’[^r-ljV-1,2 ^ *^r-l ,l^r-l ,2 ^r-1.1^r-1,2 ^ °r. 




Let 


Consider 


r.2 


V-1,2 
L^.2 . 


^;:2^rj ^;,2 

^.2 


= tr 


= tr 


V.2 J 


^ r-1.1^r-1 ,2^r-1.2 

V.2 


+ 2(1)^ 1 iC.. 1 o + o ,6 

^r-l,Pr-l,2 r,1 r 


B-2 


iJ-8) 


(B-9) 


1^.2 


(B-10) 


(B-11) 


. (B-12) 

* 



From equations (B-10) and (B-12), one obtains the required recursive 
expression, ^ 


B-3 
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